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Abstract 

The presence of groups containing high leverage outhers makes linear re- 
gression a difficult problem due to the masking effect. The available high 
breakdown estimators based on Least Trimmed Squares often do not suc- 
ceed in detecting masked high leverage outliers in finite samples. 

An alternative to the LTS estimator, called Penalised Trimmed Squares 
(PTS) estimator, was introduced by the authors in [28ll29j and it appears 
to be less sensitive to the masking problem. This estimator is defined by 
a Quadratic Mixed Integer Programming (QMIP) problem, where in the 
objective function a penalty cost for each observation is included which 
serves as an upper bound on the residual error for any feasible regression 
line. Since the PTS does not require presetting the number of outliers 
to delete from the data set, it has better efficiency with respect to other 
estimators. However, due to the high computational complexity of the 
resulting QMIP problem, exact solutions for moderately large regression 
problems is infeasible. 

In this paper we further establish the theoretical properties of the 
PTS estimator, such as high breakdown and efficiency, and propose an 
approximate algorithm called Fast-PTS to compute the PTS estimator 
for large data sets efficiently. Extensive computational experiments on 
sets of benchmark instances with varying degrees of outlier contamination, 
indicate that the proposed algorithm performs well in identifying groups 
of high leverage outliers in reasonable computational time. 



Keywords: Robust regression; quadratic mixed integer programming; 
least trimmed squares; outliers detection. 
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1 Introduction 

In multilinear regression models experimental data often contains outliers and 
bad influential observations, due to errors. It is important to identify these 
observations and eliminate them from the data set, since they can lead the 
regression estimate to take erroneous values. If the data is contaminated with a 
single or few outliers the problem of identifying such observations is not difficult, 
but in most cases data sets contain groups of outliers which makes the problem 
more difficult due to masking and swamping effects. An indirect approach to 
outlier identification is through a robust regression estimate. If a robust estimate 
is relatively unaffected from outliers, then the residuals from the robust fit 
should be used to identify the outliers. 

Two commonly used criteria for robustness are the breakdown point and 
efficiency of the estimator. The breakdown point can be roughly defined as the 
minimum percentage of outliers present in the data, that could lead the error 
between the robust estimate and the hypothetical true estimate to be infinitely 
large. It is desirable for a robust estimator to have a breakdown point of close 
to 50% of the sample size. 

A well known estimator with high breakdown point is the Least Trimmed 
Squares (LTS) estimator of Rousseeuw and Leroy 22J. The method consists of 
finding a n — k subset of observations whose deletion from the data set would 
lead to the k smallest residual sum of squares. However, it is well known that 
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the LTS loses efficiency. Several robust estimators have been proposed in the 
literature as extensions to the LTS, to obtain high breakdown point and simul- 
taneously improve the efficiency. Among them are the S-estimator proposed 
by Rousseeuw and Yohai [13], the MM-estimator proposed by Yohai [55], the 
r— estimator proposed by Yohai and Zamar [57], the SIS estimator proposed by 
Coakley and Hettmansperger [^, WLSE computed from an initial high break- 
down robust estimator by Agostinelli and Markatou [1] and the REWLS by 
Gervini and Yohai |9]. Most of these estimators start from an initial robust 
scale a LTS and robust regression coefficient estimate ^lts given from the LTS 
estimator with coverage A: ~ [{n + p + l)/2] ~ 50%. Then they compute the 
standardised residuals using different schemes based on robust scale (Jlts and 
design weights with appropriate cut-off values, and they reconsider which of the 
potential outliers should remain in the basic subset as clean data and which 
should be eliminated as outliers. Thus, they simultaneously attain the maxi- 
mum breakdown point of the LTS estimator while improving its efficiency. The 
key to the success of all the aforementioned methods is to start with a good 
initial regression coefficient estimate (3lts- When a finite sample contains mul- 
tiple high leverage points, these may be masked outliers and would affect the 
LTS estimator resulting in a biased initial estimate. Moreover, the LTS method 
requires a priory knowledge of the coverage k, or equivalently the number n — k 
of the most likely outliers that produces the largest reduction in the residual 
sum of squares when deleted. Unfortunately, this knowledge of k is typically 
unknown. Gentleman and Wilk [8j . Computation of the LTS estimator requires 
the solution of a hard combinatorial problem, and there have been many exact 
and approximation algorithms proposed in the literature (see Section \2.2\ . 

A different approach to obtain a robust estimate has been proposed in [29j 
called Penalised Trimmed Squares (PTS), which does not require presetting 
the number n — A: of outliers to delete from the data set. The PTS approach 
"trims" outliers from the data but instead of discarding a fixed number of ob- 
servations, a fixed threshold for the allowable size of the adjusted residuals is 
used. The new estimator PTS is defined by minimising a loss function^ which 
is the sum of squared residuals and penalty costs for deleting bad observations. 
These penalties regulate the threshold of the allowable adjusted residuals, as 
well as the coverage. In order to overcome the problem of groups of masking 
outliers containing almost identical high leverage points, lower penalties are pro- 
posed yielding a smaller adjusted residual threshold for such observations. These 
penalties are a function of robust leverages resulted from the MCD estimator 
by Rousseeuw and Van Driessen 20j. Computationally, the PTS estimator also 
involves the solution of a hard combinatorial problem. 

The purpose of this paper is twofold. First, the robust properties of the 
PTS estimator as presented in [29j such as equivariance, exact fit property, high 
breakdown point and efficiency, are established. Secondly we present an efficient 
approximation algorithm called Fast-PTS, to compute the PTS estimator for 
large data sets without having to solve the problem exactly. 

The organisation of the paper is the following. Section [2] contains some 
preliminary notations and definitions. The definition of the PTS estimator, its 
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robust properties, the computation of the penalties and finally an equivalent 
mixed integer quadratic programming formulation are presented in Section [3l 
Section 2] describes the Fast-PTS algorithm, where initially a set of necessary 
optimality conditions is given, and the algorithm is then presented in pseu- 
docode. Extensive computational experiments to examine the performance of 
the Fast-PTS algorithm with respect to, both the quality of the approximate 
solution when compared to the exact solution of small instances, and the robust 
performance to a set of benchmark and artificially generated large instances, are 
given in Section O Finally, concluding remarks and further possible extensions 
to both the algorithm and the robust estimator are given in the last section. 



2 Trimmed Squares Regression 
2.1 Preliminaries 

In this section we will state some well established facts in order to set up the 
notation that will be used for the rest of the paper. We consider the multi-linear 
regression model with p independent variables 

y = X/3 + u, (1) 

where y = (yi, y2, ■ ■ ■ , Vn)^ is the n x 1 vector of the response variable, X is a 
full rank n x p matrix with rows = [xn , Xi2^ . . . , Xip) of explanatory variables, 
/3 is a p X 1 vector /3 = (32, ■ ■ ■ , Pp)'^ of unknown parameters , and u is a 
n X 1 vector u = (ui, U2, ■ ■ ■ , Unf" of iid random errors with expectation zero 
and variance cr^. We observe a sample (xii,Xi2, . . . , Xip,yi), for i = 1, 2, . . . , n, 
and construct an estimator for the unknown parameter /3. The Ordinary Least 
Squares Estimator (OLS) is defined by minimising the squared residual loss 
function 

n 

0L5(X,y) := argmin ^ r(/3)2 (2) 
i=i 

where r(/3)i is the regression residual r{^)i := yi—XifS. We will write instead 
of r{P)i whenever the parameter vector f3 need not be explicitely stated. It is 
well known that a solution to ^ is obtained in polynomial time (i.e. 0{n^)) 
by the normal equations 

0L5(X,y)-(X^X)-iX^y 

A transformation of the residuals that will be useful in this work is the 
adjusted residual ai which is defined as 



(3) 



where hi (0 < hi < 1) measures the leverage of the i*^ observation and is 
defined as hi :— xf (X^X)~^Xi. The reduction in the sum of squared residuals 
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in ^ due to the deletion of an observation (x^, yi) is the square of its adjusted 
residual 



Unfortunately, points that are far from the predicted line (outliers) are overem- 
phasised. There are several types of outliers that can occur in a data set (X, y). 
Following the terminology of Rousseeuw and Van Zomeren [21], a point (x;, yi) 
which does not follow the linear pattern of the majority of the data but whose 
Xi is not outlying is called a vertical outlier. A point (x^, yi) whose x^ is outlying 
(large hi) is called a leverage point. It is called good leverage point when (xj, yi) 
follows the pattern of the majority otherwise it is called bad leverage point. The 
OLS estimator is very sensitive to outliers, in the sense that the presence of 
any type of previously mentioned types of outliers greatly affects the solution 
of We wish to construct a robust estimator for the parameter (3, such that 
the influence of any observation (xi^yi) on the sample estimator is bounded. 



2.2 Least Trimmed Squares 

Rousseeuw introduced the Least Trimmed Squares (LTS) estimator in [19], 
which fits the best subset of k observations, removing the rest n — k obser- 
vations. The LTS estimator is defined as follows defined as: 

k 

LTS{X, y,k):^ arg min ^ r (/3)^^) (4) 
^ i=i 

s.t. r{f3)l) < r{(3)l^ <■■■< r(/3)2„) 

where k is called the coverage, and is chosen as A; > [{n + p + l)/2] a priori, so 
as to maximise the breakdown point. In ([4]) and in what follows, we employ the 
convention of writing r(/3)(j) for the smallest residual error with respect to 
/3. The LTS estimator has high breakdown point but loses efficiency, since n — k 
observations have to be removed from the sample even if they are not outliers. 
Assuming that the set of points (xi,yi) are in general position, that is any two 
distinct subsets of p points have different regression lines, and observing that 
for a given k the problem in ^ consists of solving (^) ordinary least squares 
problems, it is immediate that it has a unique solution. 

Problem ^ has been formulated as a nonlinear programming problem with 
linear constraints and non-convex objective function by Giloni and Padberg 
in [TD], thereby its local minima are well characterised by the Karush-Kuhn- 
Tucker optimality conditions. In particular the authors in [10] establish its 
equivalency with a concave minimisation problem, which is known to be NP- 
Hard, and provide a procedure for computing local minima. The exact compu- 
tation of (U requires an exponential number of steps with respect to the size of 
the problem, therefore it can only be applied to small instances. AguUo in [2] 
presented a branch and bound procedure for the exact computation of @ along 
with several procedures to reduce the computational effort, and managed to find 
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the global optimum for several benchmark instances of size n < 50. Polynomial 
time algorithms for the exact computation of the LTS for simple {p = 1) lin- 
ear regression problems are given by Hossjer [15j . and recently by Li jl6| who 
utilises advanced data structures to attain improved computational complexity. 
Due to the high computational complexity of the LTS estimator there have been 
a plethora of approximation algorithms appeared in the literature with varying 
success. We mention among others, the Forward Search algorithm by Atkin- 
son [31 3] , the Feasible Solution algorithm by Hawkins [13] and the more 
recent Fast-LTS algorithm by Rousseeuw and Van Driessen [5T] which currently 
the most accurate and efficient algorithm. 

3 Penalised Trimmed Squares 

The basic idea of the PTS estimator is to insert fixed penalty costs for each 
observation into the objective function of such that only observations whose 
adjusted residuals is larger than their penalty costs are deleted from the data 
set. Therefore instead of defining a priori a coverage k, the penalty costs are 
defined and the coverage becomes a decision variable. In the sections that follow 
suitable penalties for multiple high-leverage outliers are proposed. We consider 
as most likely outliers that subset of the observations that produces significant 
reduction in the sum of squared residuals when deleted. The proposed PTS 
estimator can be defined through a minimisation problem, with an objective 
function split into two parts; the sum of k squared residuals in the clean data 
and the sum of the penalties for deleting the rest n — k observations. 

k n 

PT5(X, y, p) arg min J] r{f3)f,^ + ^ (5) 

' i=l i=k+l 

S.t. r{/3)l^ < r{f3)f,^ <■■■< r(/3)2„) 

where p = (pi, . . . ,p„) and pi is the penalty cost for deleting the observation 
defined as 

Pi max{e, (cct)^} 

for some small e > 0, where ct is a robust residual scale and c is the cut-off 
parameter. Although these penalties are constant in the basic definition of the 
PTS, as we shall see later in section each observation will have an individual 
penalty. The estimator performance is very sensitive to the penalties defined 
a priori which regulate the robustness and the efficiency of the estimator. The 
choice of the robust scale & plays an important role in the coverage of the PTS 
estimator. The minimisation problem ([5]) is not convex since it is equivalent 
to a quadratic mixed integer programming problem, nevertheless there exists a 
unique solution under mild assumptions on the data (see section [3. 4p . 

Let us define with fpTs{f3,k) the objective function of ([5]). We have the 
following straightforward observations regarding the optimisation problem in 
([5]), where the set of feasible solutions is all (/3, k) for (3 gMP and fc = 0, . . . , n: 
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i) For any feasible solution (/3, k) there exists an associated set of k obser- 
vations as determined by the ordering 

r(/3)^i) < r(/3)f2) < • • • < r{f3)f,^ <■■■< r{/3)ly 

ii) For any (3 eMP there exists k* such that 

/pT5(/3, k*) < fpTsif3, k), Vfc = 0, . . . , n, 

where k* is determined by those observations with squared residuals w.r.t. 
(3 less than their respective penalties. Therefore at optimality, (3 G MP 
determines k uniquely. 

iii) Since the second term of the objective function in ([5]) is independent of /3, 
for any feasible solution (/3, fc) 

fpTs{f3oLS,k) < fpTs{0-,k), 

where (3ols is the ordinary least squares estimate obtained for the set of 
observations so defined by {fi.k). 

We summarise the above in the following lemma which will be used later. 

Lemma 3.1. // {f3pTSi kpTs) the solution obtained by 0) for sample data 
(X, y) then 

OLS(XkpTSiykpTs) = f^PTS, 

where (X-kpTs ^ykpTs) the submatrix and subvector o/(X,y) respectively, as 
determined by the set of kpxs observations defined by {fipTSikpTs)- 

In view of the above lemma and ([3]) wc can now state the general principle 

of the PTS estimator, that is to delete an observation i if its squared adjusted 
residual is larger its the penalty cost 

a' > (c<t)2. (6) 

Therefore, the adjusted residual has as a threshold the square root of the deleting 
penalty ca. Given that the PTS estimate is the OLS of the clean data set, the 
resulted adjusted residuals must be smaller or equal than the penalties 

: < CO-, Q<i<k. (7) 



where h* is the leverage of the i*'' observation in the k observations. Otherwise, 
the observation can not remain in the coverage sample due to the PTS 
principal. 

The PTS as defined in ([5]) "trims" outliers from the data but instead of dis- 
carding a fixed number of observations, a fixed threshold {ca) for the allowable 
size of the adjusted residuals is used. 
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3.1 Properties of the PTS 

Most of the robust properties of the PTS estimator are inherited by the use of 
the robust scale (t, which in our case will be obtained from the LTS estimator 
with coverage (n + p + l)/2. In what follows we will employ the notation used 
in [12]. 

Let T(X,y) be an estimator for some sample data (X,y). An estimator T 
is called regression equivariant if 

r(X,y + Xv) =T(X,y)+v, for any v G R", 

scale equivariant if 

r(X, cy) - cT(X, y), for any c e M, 
and affine equivariant if 

T(XA, y) = A-1T(X, y), for any nonsingular A G 
Lemma 3.2. The PTS estimator is regression, scale and affine equivariant. 

Proof. For regression equi variance, by Lemma |3. II there will exist some subset 
of k observations were 

PT5(X, y + Xv,p) = (X^Xfe)-iXnyfc+Xfcv) 
= (X^X,.)-^Xjy,+v 
= Pr5(X,y,p)+v. 

Similarly for scale and affine equivariance. □ 

The PTS estimator determines that subset of observations whose deletion 
produces the largest reduction in the sum of squared residuals. The LTS estima- 
tor determines the subset of size k with the minimum sum of squared residuals. 
The relationship between the two estimators can be seen by the following propo- 
sitions. 

Proposition 3.1. If the PTS estimator for sample data (X,y) converges to the 
solution i(3pTSikpTs), then LTS(X,y,kpTs) = f3pTS- 

Proof. Observe that any k observations will have the same sum of penalties, 
therefore we will have 



LTS{X,y,kpTs) = argmin ^ r(/3)(j) 



2 = 1 

(kpTS \ 

fipTS 

□ 
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Conversely under mild assumptions, from the LTS estimator given in Q and 
using the largest residual as the deleting penalty, the PTS estimator given 
in (O leads to the same results. 

Proposition 3.2. If for sample data (X, y) and coverage k we have LTS{^, y, k) — 
f^LTS while the largest residual r{(3LTs)k is an increasing function with re- 
spect to k, then for fixed penalty Pi = {r{/3LTs)k)^ ^ i — l,---,n, we have 
PTS(K,y,p)=f3LTS- 

Proof. Let PrS'(X,y,p) — {f3pTS, kpTs)- By Proposition 13. II enough to show 
that kpTs = k. Moreover the largest residual r{(3pTs)kpTs '^ill ^^^o be an 
increasing function with respect to kpxs- We will have for any k > kpxs 

^PTS n k k n k 

E = E^«'- E ^' + Ep*+ E Pi 

2—1 i—kpTS ^—1 i—kpTS i—k i—kpTS 

k n k 

= E^?+E^'^+ E ip^-'^l) 

2—1 i—k i—kpTS 

where the terms in the last summation are all nonnegative since > {r{f3LTs)i)'^ j 
i = kpTs, ■ ■ ■ ,k. Therefore the minimum is obtained for k = kpTS- Similarly 
for k < kpTs- n 

In the LTS estimator the coverage k has to be chosen between n/2 and n. 
For coverage k = [(n + p+ l)/2] the maximum breakdown point is obtained by 
loosing efficiency whereas for k = n the obtained breakdown value is reduced to 
1/n. As it is the case with most robust estimators, there is a trade-off between 
robustness and efficiency. The advantage of the PTS procedure is that there 
is no need to define a priory the coverage k. For given penalty (ctr)^, where 
tr is a high breakdown robust scale, the PTS can resist against all influential 
observations with large adjusted residuals (i.e. > ca). 

3.1.1 Exact Fit Property and Breakdown Point 

In this section we will investigate two common robust properties for the PTS 
estimator, namely the exact fit and breakdown points. The exact fit point of an 
estimator T is defined as the minimum fraction of observations on a hyperplane, 
necessary to guarantee that the estimator coincides with that hyperplane |27| . 
Formally 

,5*(r,X,y) := min{^ : 3(3 such that X,„/3 = y,„,T(X,y) = /s} , (8) 

where (Xm,ym) is the submatrix and subvector of (X,y) respectively, as de- 
termined by the set of m observations. A dual notion is used in |22| . where 
the exact fit point is taken as the minimum fraction of observations not on a 
hyperplane that will force the estimator to give a different estimate from that 
hyperplane. An estimator is said to have the exact fit property when S* < 0.5. 
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The breakdown point of an estimator T is defined as the minimum fraction 
of observations that could take arbitrary values and force the estimator to vary 
indefinitely from its original estimate. This is the finite version of breakdown 
point as it was introduced in [7] , and could be stated formally as 

e*(r,X,y) mini-: sup ||r(X, y) - r(X„„ y™)|| = oo 1 , (9) 

I " X,„,y,„ J 

where (Xm,ym) is the submatrix and subvector of (X,y) respectively, as de- 
termined by the set of m observations. The relationship between exact fit point 
and breakdown point for regression equivariant estimators is 

S* >l-e* (10) 

where under mild assumptions equality also holds. 

Proposition 3.3. If for sample data (X,y) there exists some f3 such that 
strictly more than "^2~^ "-^ observations satisfy x.il3 = and are in general 
position, then the PTS solution with robust scale cflts equals (3. 

Proof. Since the above is true for the LTS estimator (see corollary in page 134 
of [22]), we will have ctlts = 0. Therefore the penalties as defined in ([5]) will 
be p = (e, . . . , e) and at optimality it will hold 

PT5(X, y, p) = LT5(X, y, fc^Ts) = /3, 

iovkLTS = ^^- □ 

It is known |22j that for regression and scale equivariant estimators it holds 

* ^ n — p + 2 
^ ~ 2n ' 

Below we show that the PTS estimator has a high breakdown point. 

Proposition 3.4. The breakdown point of the PTS estimator with robust scale 
o'LTS is bounded from below as 

^ n — p — I 

'^^^ - 2n • 
Proof. By Lemma 13.21 and (fTU)) we know that 

5 PTS — 1 ^ ^PTS' 

while by Proposition 13.31 we get 

n + p+1 



^PTS — 



2n 



Therefore by the above 



> 1 



^^^^ 2n 2n 



p-1 

□ 
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3.1.2 Efficiency 



Let (X, y) be sample data where the random errors u as given in the model ([T]) 
follow the normal distribution, u ^ N(0,a^). 

Consider now that n ^ oo. We know from [22] that the variance u^ts ob- 
tained from the LTS estimator is consistent (i.e. lim„^oo o'lts — <^)- Moreover 
for each observation i we have that the leverage hi = Q, which means that by 
the PTS general principle an observation i will be deleted if 

a1 = r{(3oLsfi > ca^ 

Therefore, under Gaussian conditions on the random errors and for large n, 
we can say that the PTS estimator with robust scale a = glts and cut-off 
parameter c € [2, 3], deletes a very small fraction of observations as outliers. 



3.2 Penalties Computation 
3.2.1 Robust scale estimate a 

The key to the success of the PTS estimator is to use a robust scale a for 
evaluating the proper penalties. This robust scale a is obtained via the LTS 
estimator with coverage fc^TS = [(^+P+l)/2] as follows. Initially a preliminary 
error scale s is computed 



Cfcr 



— ^ r{f3LTs)ly 



kLTS ^ 



where the constant Ck,:^Ts,n is so chosen in order to make the scale estimation 
consistent at the Gaussian model, that is 



CkLTS,n = 1/Wl - 1 ■!> 



where 

^kLTS,n 



Then we improve the efficiency results of the LTS by applying a re weighted 
procedure (see [22]). First the standarized residuals r{(3LTs)i/ s are computed, 
and they are used to determine a weight for each observation as follows: 

1 if < 2.5, 

otherwise. 

The final robust scale is then given by 
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3.2.2 Unmasking Multiple High Leverage Outliers 

Based on the PTS principal, an observation is deleted if its adjusted residual is 
greater than the square root of the penalty cost > or. For y-outliers and even 
for some x-outhers such an approach has successful performance. Unfortunately 
the masking problem arises when there is a group of high leverage points in the 
same direction. As it is known, the leverage value hi can be distorted by the 
presence of a collection of points, which individually have small leverages but 
collectively form a high leverage group. Peha and Yohai [T^ point out that the 
individual leverage hi of each point might be small hi << 1, whereas the final 
residual may appear to be very close to 0, and this is a masking problem. Thus 
in a set of identical high leverage outliers, the adjusted residual is masked and 
might be too small << Cia, and as a consequence a masked high leverage 
outlier may not be deleted. 

In order to overcome the distortion caused by the masking problem and 
unmask the leverage hi of the potential high leverage points, we are based on 
the Minimum Covariance Determinant (MCD) procedure (Rousseeuw and Van 
Driessen [2Q])- For given coverage k, the MCD procedure yields the "clean" 
matrix where n — k vectors x^ corresponding to high leverage observations have 
been removed from the original matrix X. Thus, for coverage k close to 51% 
(i.e. k = [{n +p + l)/2]), the leverage of each point (x^, yi), i = fc + 1, . . . , n as 
it joins the clean data subset taken from MCD is 

= (X^jXi.^j)"^Xi, for i = k + l,...,n, 

where li-k.i is the matrix X/c plus the row x^. In the above h* is the unmasked 
leverage of each one of the potential high leverage points which can be considered 
as the leverage at the clean data set yielded from the MCD procedure. For the 
rest of the observations (x^, i/i) the new leverages now are 

h* = xf (X^X,.)-ix„ for ^ = 1, . . . , k. 

Given that the high leverage identical outliers have been removed from the clean 
matrix X^. due to the MCD procedure, the new leverage values h* of the masked 
points will appear larger than the original hi. Thus, in order to overcome the 
masking problem we transform the adjusted residuals as follows: 

a* = y i " ^ (12) 

' VT^i' ^/T^ yr^' 

Based on the reliable results of the MCD procedure, the new robust adjusted 
residual in (fT2)l is a good diagnostic criterion for all observations even for the 
multiple high leverage outliers. Incorporating the new robust adjusted residual 
now in the PTS estimator, we can see how the penalties get individualised for 
each observation since an observation now will be considered as an outlier if 

a* > (ca) =^ > cy/l — h* a, 
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that is the penalty for each observation is defined as 



(13) 



where c is the cut-off parameter, a the robust scale while h* the robust leverage. 

Thus, the PTS estimator tends to remove the masked high leverage observa- 
tions from the data set, unless their residuals are very small. However, according 
to the above analysis, good high leverage points may also appear as extreme 
points with large leverages h* and the yielded adjusted residual threshold may 
be too small. As a consequence, some of the good high leverage points may 
be deleted from the data set. This affects only the efficiency of the solution. 
However, the efficiency of the final estimate is improved by testing each poten- 
tial outlier one by one in the reinclusion stage of the PTS procedure which is 
described in the next section. 



3.3 Reinclusion of Observations 

In order to improve the efficiency of the ak all observations with small studen- 
tized residuals are reincluded into the clean subset of size k. We follow the 
reinclusion test for each of the n ^ k observations, similarly to Hadi and Si- 
monoff [H] and Pena and Yohai [17] . All the n — k points deleted initially stage 
are tested one by one using the studentized predicted error for outlyingness 

U='-^^4^^ ^ + l<.<n, 

where hi — x^[X|^X/j]~^Xi. Under normality assumptions, ti follows Student's 
t distribution with k degrees of freedom. Therefore, every observation (:x.i,yi), 
is reincluded in the subset of the clean data if ti < ta/2,k- We have found 
empirically that for small samples, ta/2,k = 2 works well. The final PTS estimate 
is the OLS solution of the resulting data set after the reinclusion. 



3.4 Quadratic Programming Formulation 

The PTS estimator as defined in ^ can be stated as a Quadratic Mixed Inte- 
ger Programming (QMIP) problem, by expanding the residuals and adding 0-1 
decision variables, as follows: 

n 

niin ^{rj +5^{caf) 
i=i 

s.t. xf /3 + r,; > - (14) 

xf /3 ~ n < yi + Si 
s^ < 6,M 

S^ G {0, 1} 

Ti, Sj > i = 1, . . . ,n. 
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where Si can be regarded as the pulhng distance for moving an outher towards 
the regression Une, 6i is a decision variable which takes the value of 1 if observa- 
tion i is to be removed from the clean data and otherwise, and M is an upper 
bound on the residuals r^, i = 1, . . . , n. Given any fixed S € {0, 1}", from the 
2" possible ones, and using matrix notation problem (|14p becomes 

mm r r + o p 

s.t. X/3 + r > y - s 
X/3 - r < y + s 
s<dM 
r,s > 0, 

where p = {{ca)'^ , {caf , . . . , {ca^y , r = (ri, . . . , r„)^, s = (si, . . . , s„)^, 
y = (yi, . . . ,y„)^, /3 = (/3i, . . . ,/3„)^ and the matrix X = [xi,X2, . . . ,x„]^. 
This problem has linear constraints and a convex quadratic objective function, 
since the Hessian of r-^r has nonnegative eigenvalues (and it is therefore pos- 
itive semi-definite). Therefore we have a convex program, which will have a 
unique global optimum solution according to the Karush-Kuhn- Tucker optimal- 
ity conditions [S]. Considering that there is a finite number of possible S, we 
can conclude that under the assumption that the data are in general position, 
problem or equivalently ^ has a unique solution. 

Due to the high computational complexity of the resulting QMIP problem 
the PTS estimator as defined by cannot be solved exactly in reasonable 
time, even for moderately sized problem instances (i.e. n < 50). This is because 
the computational time required by even the state of the art integer solvers 
such as CPLEX, grows exponentially with respect to the size of the problem 
instance. Moreover it is almost certain (unless P = NP) that there does not 
exist an exact polynomial time algorithm that solves (|14p. Therefore the need 
arises for a specialised approximation algorithm, which we present in the next 
section. 

4 Fast-PTS 

Let us provide a combinatorial formulation of problem (I14p in order to facilitate 
the discussion of the algorithm. Observe that Si = 1 implies ri — 0, while for 
(5i = we have that — r{f3)i which is the residual error of observation i with 
respect to the regression vector (3. If we let the set of observations be O we 
could equivalently formulate problem pij) as follows: 

min L(r):=^r(/3T)?+ ^ K (15) 

iST iGO/T 

where pi is the associated penalty for observation i defined in (I13p , and (3t is 
the OLS solution for the set T of observations uniquely defined as 

/3t := (X?XT)-^X?yT, (16) 
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where Xr,yT are submatrices of X and y respectively, whose rows are indexed 
by the observations in T. The equivalence of problem with p4)) stems from 
the one-to-one correspondence between (3t and T C O, as it is defined by the 
normal equations in p6p . The following necessary conditions for optimality can 
be stated. 

Proposition 4.1. IfTCOisan optimal solution to il5]) then 

i) r{f3Tf,<p,, \fieT 
ii) r{(3Tf,>p^, VieO/T 

Proof. For the case i) let us define the set 

5i := e T : r(/3T)? > P^} 

and assume that 5*1 7^ 0. Then 

L(r)=^r(/3TX'-|- J2 > E ^('3tX'+ E 

leT ieO/T ieT/Si ieo/{T/Si} 

> E ^(^T/S^)^+ E P^ 

ieT/Si ieo/{T/Si} 

therefore we have found a set T/Si were L{T) > L{T / Si) which is a contradic- 
tion to the hypothesis. For the case ii) analogously define the set 

^2 := e 0/T : r(/3T) ■ < Pi] 
and assume that 6*2 7^ 0. Then 

L{T) = ^if^Tf. + p^ > + E ^(/^t).? E p^ 

i£T i^O/T ieT ieSa ieO/{T\jS2} 

> E ^(/3tusJ-+ J2 P^ 

ieTUS2 ieO/{TuS2} 

which implies that L{T) > L{T U 5*2) and again it contradicts the original 
hypothesis of the optimality of T. Therefore 6*1 = ^2 = and the proposition 
is proved. □ 

It can be easily verified by a counterexample that the above conditions are 
not sufficient. Nevertheless the number of local optima which they characterise 
seems to be much smaller than that characterised by more classic /c-exchange 
neighbourhoods which grow exponentially with respect to the size of the associ- 
ated QMIP. Specifically, if we define the k-exchange neighbourhood of a feasible 
solution T to as 

N{T)k -.= {3 CO: \TAS\ < k} 
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procedure Fast-PTS(X, y, p, Maxlter, RandomSeed) 

1 T,nin {1, ■ • • ,"}; 

2 do fc = 1, . . . , Maxlter 

3 T Construction (X, y, p, RandomSeed) ; 

4 T := LocalSearch(X, y, p, D ; 

5 if (L(r) < L(T™„)) ^ T™„ T 

6 od; 

7 return(/3T„,i„ ) 
end Fast-PTS; 



Figure 1: The Fast-PTS algorithm 



where A stands for the symmetric difference of sets, then we have the trivial 
necessary conditions for optimality 

T optimal to (US]) ^ L{T) < V5 G N{T)k. (17) 

A local search algorithm based on pT|) will have to have to search the k neigh- 
bourhood of a feasible solution entirely, which will require an exponential num- 
ber of steps on k with either a breadth-first or a depth-first search strategy. 
Therefore even for small fc = 1 , 2 this type of local search is computationally 
very expensive. 

The first necessary optimality condition of Proposition 14.11 is used to con- 
struct a feasible solution in a randomised fashion, while the second is used to 
obtain the local optimum in the region defined by this solution. 

4.1 The Algorithm 

The Fast-PTS algorithm is presented in pseudocode in Figurc[TJ It accepts as in- 
puts the problem instance data, and the parameters Maxlter and RandomSeed 
which specify the maximum number of iterations and a seed for the random 
number generator respectively. Initially the best solution is Tmin is set to be 
all the observations, while this solution is updated in line 5 of Figure [1] Each 
iteration of the Fast-PTS algorithm is composed of two phases, a construction 
phase where a good solution is built in a greedy randomised fashion starting 
from an empty solution, and an improvement phase where the solution from 
the previous phase is improved to ensure local optimality. At the end of each 
iteration an approximate solution to (fT5| is provided. The maximum number 
of iterations is provided by the user, and the best solution found among all 
iterations is returned by the algorithm as the approximation to the optimum 
solution. This randomised procedure for constructing a solution before perform- 
ing local search, has substantial experimental evidence of good performance for 
NP-Hard combinatorial optimisation problems (see [18]) since it helps the algo- 
rithm to examine a wider area of the feasible space without getting entrapped 
in local optima. 
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The construction phase of the algorithm is shown in Figure ^ Let us call a 
partial solution of say T C O penalty free if 

In the construction phase of the algorithm, initially a partial solution T C O is 
constructed by choosing at random (p + 1) observations such that T is penalty 
free (line 1 in Figure [2]). Note that any p observations are penalty free since 
they have zero residuals i.e. we have an exact fit. Then this solution is enlarged 
in a greedy randomised fashion maintaining its penalty free property. Given 
any partial solution T the following set of possible candidates is defined 

C{T) {j e 0/T : r(/3Tu,)? < P^.. G T U j} 

The elements of C(T) are then sorted in ascending order according to their 
objective function values as given in ([7]), that is 

HTUji) <L{TUj2) < ••• <i(TUj|c(T)|), 

and one observation s is chosen randomly among the first max{l, q;|C(T)|} can- 
didates, and the partial solution is updated (i.e. T :— TU {s}). The parameter 
a G [0,1] controls the degree of greediness versus randomness in the construc- 
tion of the solution. If a = 1 the the algorithm constructs the solution in a 
randomised fashion while if a = the solution is purely greedy. This proce- 
dure is repeated until we reach a partial solution T with C{T) = 0, where the 
construction phase ends. 

In lines 2-16 of the Construction procedure shown in Figure [2] the main 
iterations upon which a solution Tc is constructed, are shown. The iterations 
in lines 4-10 compute the set C{Tc) of candidate elements to be added in the 
solution, where the penalty free property is maintained in line 5. The cost as 
well as the index of the associated candidate element are added into a heap in 
line 8, which will be later used for sorting these values. Finally in lines 11- 
15, an element jc is chosen at random among the best candidates, and added 
into our current solution. The procedure outheapO returns the index j of the 
observation with the smallest L{Tc U j) value as computed in and inserted into 
the heap in lines 7 and 8. 

The solution Tc computed by the construction phase is penalty free, and we 
also know that there is no j e 0/Tc such that Tc U {j} is penalty free with 
respect to PxcU{j}- However there may be an observation j E O/Tc such that 
its residual with respect to (3t^ is less than its respective penalty. That is, the 
necessary optimality conditions stated in Proposition 14.11 mav not be satisfied. 
Therefore further improvement of the solution provided by the construction 
phase could be achieved. This is performed in the improvement phase of the 
Fast-PTS algorithm, where the solution by the construction phase is changed 
to satisfy both necessary optimality conditions. Given a solution Ti and its 
associated regressor vector fixi , a new solution T2 is computed as 

T2 := {z e O : r(/3Tj,' < Pj, 
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procedure Construction(X, y, p, rt,p, RandomSeed) 



1 Tc := CoiistructRaiidom(X, y, p, n,p, RandomSeed); 

2 while {C{Tc) ^ 0) ^ 

3 C{T,) := 0; 

4 do j G 0/Tc 

5 if (r(/3T.uj)? <P., ViGT.Uj)^ 

6 C(r,) :=C(re)Uj; 

7 L{TcUj) := E»fETeUj^(/3TeUj)f + 'E.eo/T.ujP^'^ 

8 inheap(L(reU 

9 endif; 

10 od; 

11 t =random[l,a|C'(rc)|]; 

12 doj-l,...,t^ 

13 jc outheapO; 

14 od; 

15 T, := T, U ; 

16 endwhile; 

17 return(rc) 



end Construction; 



Figure 2: The Construction procedure of the Fast-PTS algorithm 

and this is repeated until Ti = T2. As it is shown in the proof of Proposition 
14.11 at each iteration L{Ti) > L{T2) therefore the procedure will terminate in a 
finite number of steps (usually the convergence is in the order of 4 to 5 steps). 
This simple local search procedure is depicted in Figure [3] 

Overall the computational complexity of the Fast-PTS algorithm is poly- 
nomial on the size of the problem, since for a fixed number of iterations, each 
iteration requires a polynomial number of solutions to an OLS problem, which 
is itself polynomially solvable. 

5 Computational Results 

In this section we present the results of the computational experiments per- 
formed in order to verify the performance of the Fast-PTS algorithm, and the 
overall robust behaviour of the proposed estimator. First we compare the so- 
lutions found by the Fast-PTS algorithm for the QMIP problem with the 
exact solutions found by ILOG's CPLEX software, on a set of artificial problem 
instances. Then we compare the robust behaviour of the algorithm with two of 
the most succesfuU algorithms in robust regression i.e. the Fast-LTS algorithm 
as implemented by Rousseeuw and Van Driessen [21] and the Fast-S algorithm 
as presented by Barrera and Yohai |25| . These comparisons were performed on 
the following sets of instances: 
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procedure LocalSearch(X, y, p, n, p, T) 

1 Ti := T; 

2 while (T2 ^Ti) ^ 

3 Ti T2; 

4 T2 0; 

5 do i — 1, . . . , n ^ 

6 if (r(/3Tjf < P,) 

7 T2 T2 U 

8 endif; 

9 od; 

10 endwhile; 

11 return(ri); 
end LocalSearch; 



Figure 3: The LocalSearch procedure of the Fast-PTS algorithm 



• a set of "benchmarlc" data sets to regression outlier problems widely known 
in the literature, 

• a set of Monte Carlo generated data sets, proposed by Barrera and Yohai 
in [25], 

The Fast-PTS algorithm was implemented on Intel's Fortran Compiler ver- 
sion 10.1 for Linux with the optimisation flags -ipo, -03, -no-prec-div, 
-static, and ILOG's CPLEX integer programming solver version 9.1 was used. 
All computational experiments where performed on a Intel Core Duo 2.4GHz 
computer with 2GB of memory running openSUSE Linux 10. ^I^. 

5.1 Exact Solutions 

In Table [1] we can see the performance of the Fast-PTS algorithm with respect 
to exact solutions for 8 problem instances. In each instance the problem was 
solved exactly by the CPLEX mixed integer quadratic programming solver, and 
the Fast-PTS obtained the exact solution for all instances. The number of max- 
imum iterations was set to 100, although in all cases the optimal solution was 
obtained in fewer iterations. As we can see from Table [1] the computational 
effort required for the exact solution grows exponentially with respect to the size 
of the instance, while the Fast-PTS algorithm requires a fraction of a second. 
Similar comparison would be desirable for larger size instances such as the ones 
presented in section [5T3l This however is infeasible since the required CPU time 
would be excessively large for the exact solution, and the size of the branch and 
cut tree generated by CPLEX exceeds the available memory capacity. Specifi- 
cally for the problem set in Table[l] for instances with n > 128, CPLEX required 

^The code is available upon request at pitsouliagen.auth.gr 
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larger memory for the tree and terminated the solution process. 









Exact 


CPLEX 


Fast-PTS 


data set 


n 


P 


solution 


(CPU seconds) 


(CPU seconds) 


PTSl 


48 


2 


12642.9 


6.05 


0.038 


PTS2 


58 


2 


25910.7 


5.68 


0.048 


PTS3 


68 


2 


17658.1 


146.4 


0.064 


PTS4 


78 


2 


26158.8 


42.3 


0.076 


PTS5 


88 


2 


37951.5 


1368 


0.096 


PTS6 


98 


2 


41289.2 


2256 


0.092 


PTS7 


108 


2 


48519.3 


32055 


0.132 


PTS8 


118 


2 


52829.8 


37948 


0.152 



Table 1: Comparison of Fast-PTS with CPLEX 



5.2 Benchmark Instances 

Five widely used benchmark instances were examined in this experimental study. 
The first four are taken from Rousseeuw and Leroy [22] and have been studied 
by many statisticians in robust literature. The last one is an artificial example 
from Hadi and Simonoff [TT] . Table [2] gives the corresponding results for the 
PTS estimator, indicating the name of the data set, its dimension and number 
of observations, the included outliers, the percentage of outliers that have been 
identified and the running times in seconds. The penalties for deleting outliers 
are (2.^/1 — h* ■a)'^, as have been proposed in section^ These results are similar 
to those reported in the literature for other high breakdown estimators and they 
indicate that the proposed method PTS behaves reliably on these test sets. 

Telephone Data. The first data set contains the total number of telephone 
calls made in Belgium during the years 1950-1973. During the period 1964-1969, 
(cases 15-20), another recording system was used, hence these observations are 
unusually high while cases 14 and 21 are marginal. The outliers draw the OLS 
regression line upwards, masking the true outliers, while swamping in the clean 
cases 22-24 as too low. The high breakdown estimators Fast-LTS and Fast-S 
correctly flags the outliers. Also, our PTS estimator correctly identify the true 
outliers. 

Hertzsprung- Russell Stars Data. This data set contains 47 stars in the 
direction of Cygnus, where x is the logarithm of the effective temperature at the 
surface of a star and y is the logarithm of its light intensity. Four stars, called 
giants, have low temperature with high light intensity. These outliers are cases 
11, 20, 30 and 34 which can be clearly seen by a scatter plot. The three high 
breakdown methods identify successfully the outliers. 

Modified Wood Gravity Data. This is a real data set with five in- 
dependent variables. It consists of 20 cases; some of them were replaced by 
Rousseeuw [12] to contaminate the data by few outliers, namely cases 4, 6, 8 
and 19. Once again, all three methods manage to reveal the outliers. 
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Hawkins, Bradu and Kass Data. The data have been generated by 
Hawkins et. al [13] for illustrating the merits of a robust technique. This 
artificial data set offers the advantage that at least the position of the good or 
bad leverage points is known. The Hawkins, Bradu and Kass data consists of 
75 observations in four dimensions. The first ten observations form a group of 
identical bad leverage points, the next four points are good leverage while the 
remaining are good data. The problem in this case is to fit a hyperplane to the 
observed data. Plotting the regression residuals from the model obtained from 
the standard OLS estimator, the bad high-leverage point data are masked and 
do not show up from the residual plot. Some robust methods not only fail to 
identify the outliers, but they also swamp in the good cases 11-14. The Fast- 
LTS and Fast-S estimators correctly flag the outliers. For the PTS estimator, 
the MCD procedure reveals the first 14 points of this data set as high leverage 
points. Application of the PTS to these data, starting with robust scale estimate 
about a — 0.61 and downweighting the penalty cost with penalties from ()13p . 
rejects only the first 10 points as outliers, which are known as the bad leverage 
points. 

Hadi Data. These data have been created by Hadi and Simonoff 11 , in 
order to illustrate the performance of various robust methods in outlier identi- 
fication. The two predictors were originally created as uniform (0, 15) and were 
then transformed to have a correlation of 0.5. The depended variable was then 
created to be consistent with the model y — x\ ^ X2 + u with u ~ iV(0, 1). 
The first 3 cases (1 — 3) were contaminated to have predictor values around 
(15, 15), and to satisfy ?/ = xi + a;2 + 4. Scatterplots or diagnostics have failed 
to detect the outliers. Many identification methods fail to identify the three 
outliers. Some bounded infiuence estimates have largest absolute residual at 
the clean case 17, indicating potential swamping. The efficient high breakdown 
methods Fast-LTS and Fast-S do identify the three outliers as the most outlying 
cases in the sample, but the residuals are too small to be considered significantly 
outliers. In contrast, robust methods proposed by Hadi and Simonoff [TT] and 
Fast-PTS estimator identify correctly the clean set 4 — 25, with each of the cases 
1 — 3 having residuals greater than 3.78. 



data set 


n 


P 


outliers 


%outliers 
identified 


time 
(CPU seconds) 


Telephone 


24 


2 


15,16,17,18,19,20 


100 


0.05 


Stars 


47 


2 


11,20,30,34 


100 


0.18 


Wood 


20 


6 


4,6,8,19 


100 


0.06 


Hawkins 


75 


4 


1,2,3,4,5,6,7,8,9,10 


100 


0.50 


Hadi 


25 


3 


1,2,3 


100 


0.08 



Table 2: Performance of the Fast-PTS algorithm on some small data sets. 
5.3 Monte Carlo Simulation 

To explore further the properties of the PTS method, we have performed an 
extensive set of simulation experiments for larger sample sizes and observation 
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dimensions taken from Barrera and Yohai [25] . The experiments compare the 
performance of our Fast-PTS algorithm with the results of Fast-LTS and Fast-S 
algorithms. 

Barrera and Yohai [53] considered a model as in ^ with x = (1, X2, • ■ ■ , Xp), 
where a proportion (1 — e) in the data samples x = (xi, a;2, . . . , Xp, y) follow a 
multivariate normal distribution. Without loss of generality the mean vector is 
taken equal to and the identity as the covariance matrix. These observations 
follow the model in ^ with /3 = 0. The contaminated observations are high 
leverage outliers with x = (1, 100, 0, . . . , 0) and y = slope x 100, where the slope 
of contamination takes values from 0.9 to 2.0. 

Under high leverage contaminations, the objective functions of the three 
high breakdown estimators typically have two distinct types of local minimum, 
that is one close to the true value of the regression parameter and a second one 
close to the slope of the outliers. 















slop^ 


e 












method 


0.9 


1.0 


1.1 


1.2 


1.3 


1.4 


1.5 


1.6 


1.7 


1.8 


1.9 


2.0 


Fast-PTS 


58 


44 


21 


8 


3 


2 


1 


1 


1 











Fast-S 


88 


60 


29 


8 


1 























Fast-LTS 


100 


100 


95 


86 


67 


42 


21 


9 


4 


1 









Table 3: Percentage of samples where convergence occured to the wrong local 
minimum, n = 400, p = 36 and 10% of outliers located at (xi, . . . ,X36,2/) = 
(l,100,0,...,0,s;opex 100). 



Two measures of performance are used in this Monte Carlo study to compare 
the three estimators: 

• The percentage of samples for which each algorithm converged to a wrong 
local minimum i.e. a local minimum close to the slope of the outliers. 

• The mean square error (MSE) of the parameter estimate 0. 

Several different values of the slope of contamination were used to determine 
the behaviour of the three algorithms. Its range varies from 0.9 to 2.0 with 
sample sizes of n = 400 and p ^ 36. The proportion of outliers e was set to 
10%. 

Tables [3] and [4] contain the percentage of convergence to a wrong local min- 
imum and the MSEs respectively. From Table [3] we observe that the Fast-PTS 
has the lowest wrong convergence with slope between 0.9 and 1.1. For larger 
slope the results are comparable for all estimators. In Table 2] it is shown that 
the Fast-PTS has the lowest MSE for the contaminated data with slope between 
0.9 and 1.9. For the rest of the data, the MSE of the Fast-PTS is comparable 
with that of Fast-LTS and Fast-S. The main conclusion from Tables |3] and [4] 
is that for contamination with small slope, the Fast-PTS perform better than 
Fast-S and Fast-LTS in percentage of wrong convergence. However, for larger 
slopes, the PTS has important improvement in the MSEs. 

To explore further the comparison of the three estimators, we use again the 
same data as Barrera and Yohai [25j . with different values of n, p, considering 
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slope 



method 


0.9 


1.0 


1.1 


1.2 


1.3 


1.4 


1.5 


1.6 


1.7 


1.8 


1.9 


2.0 


Fast-PTS 


1.54 


1.30 


0.82 


0.48 


0.32 


0.29 


0.36 


0.27 


0.31 


0.30 


0.33 


0.48 


Fast-S 


1.65 


1.40 


0.98 


0.62 


0.48 


0.47 


0.45 


0.45 


0.47 


0.46 


0.46 


0.46 


Fast-LTS 


1.73 


1.46 


1.39 


1.34 


1.08 


0.92 


0.74 


0.58 


0.51 


0.47 


0.46 


0.46 



Table 4: Mean Squared Errors, n — 400, p — 36 and 10% of outliers located at 
(xi,...,x36,y) = (l,100,0,...,0,sZopex 100). 

only one value of the contamination slope, y = 100 x 1 and e = 10%. Tables [5] 
and [6] show the average time needed until convergence of the algorithm, the 
percentage of samples where the algorithm converged to a wrong local minimum 
("%wrong") and the mean squared error ("MSE" ) for the three estimators. 
From Table [5] we observe that the Fast-PTS performs significantly better in 
both performance criteria especially when the data set is of size n — 100 or 
n = 500. 









Fast-PTS 




Fast-S 


Fast-LTS 


n 


P 


CPU time 


%wrong 


MSE 


%wrong 


MSE 


%wrong 


MSE 


100 


2 


0.08 


0.2 


0.030 


30.6 


0.370 


50.8 


0.692 




3 


0.09 


0.2 


0.052 


37.2 


0.506 


56.0 


0.890 




5 


0.12 


0.2 


0.099 


46.8 


0.773 


64.8 


1.193 


500 


5 


2.28 


0.0 


0.014 


15.4 


0.193 


61.4 


0.803 




10 


3.78 


0.0 


0.029 


19.8 


0.292 


65.8 


0.935 




20 


8.31 


0.2 


0.069 


30.4 


0.547 


79.4 


1.207 


1000 


5 


8.75 


0.0 


0.007 


7.8 


0.095 


61.2 


0.715 




10 


14.41 


0.0 


0.014 


5.4 


0.089 


69.4 


0.859 




20 


31.11 


0.0 


0.028 


12.4 


0.208 


82.4 


1.158 



Table 5: Samples with 10% of outliers located at (xi, . . . , Xp, y) = 
(1,100,0,..., 0,100). 

For heavier contamination e — 0.20 and slope 2.2, Table [6] shows the results 
of these simulations. As in Table \5\ we observe that the Fast-PTS has the best 
overall performance especially for the small data sets n = 100. However, for 
larger data sets {n > 500, the Fast-PTS and Fast-S perform comparable, since 
the outliers are easier to detect by both robust estimators due to the large slope 
value. 

To continue the implementation of the Fast-PTS algorithm and compare it 
with the well-known methods discussed in this article, we perform extra Monte 
Carlo experiments to evaluate the performance of our robust procedure. To 
carry out one simulation run, we proceeded as follows. The distributions of 
independent variables and errors and the values of parameters are given. The 
observations yi, were obtained following the regression model as in ([1]) with 
p = 3 and n = 50, where the coefficient values are (3i = 1.20, f32 = —0.80 and 
a zero constant term f3o — 0.0. We prefer the Gauss distribution for the iid 
error term u ~ A'^(0, cr^ = 16^), while xi and X2 are iid values drawn also from 
normal distributions N{fi — 20, tr^ = 6^) and N{fi = 30, cr^ = 8^) respectively. 
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Fast-PTS 




Fast-S 


Fast-LTS 


n 


P 


CPU time 


%wrong 


MSB 


%wrong 


MSE 


%wrong 


MSE 


100 


2 


0.08 


0.2 


0.032 


10.0 


0.543 


40.9 


2.333 




3 


0.09 


0.0 


0.056 


15.6 


0.947 


51.6 


3.218 




5 


0.11 


0.8 


0.109 


34.0 


2.303 


70.2 


5.241 


500 


5 


2.28 


0.0 


0.016 


0.0 


0.025 


23.4 


1.236 




10 


3.74 


0.0 


0.032 


0.0 


0.076 


33.8 


2.102 




20 


7.82 


11.2 


0.873 


9.2 


0.734 


72.2 


4.276 


1000 


5 


9.40 


0.0 


0.008 


0.0 


0.012 


11.0 


0.542 




10 


14.94 


0.0 


0.016 


0.0 


0.025 


20.4 


1.205 




20 


31.63 


0.0 


0.030 


0.0 


0.055 


49.8 


2.980 



Table 6: Samples with 20% of outliers located at (xi, . . . ,Xp, y) = 
(1,100,0,..., 0,220). 

We consider that the sample may contain three types of outliers, regression 
outhers ("bad" high-leverage points), "good" high-leverage points, and response 
outliers (y-outliers) . An extra value is drawn from the uniform distribution 
U{a = 80,6 = 220) and for the regression outlier is added to xi or X2, for 
the "good" leverage point is added to xi or X2 but the value of the dependent 
variable y follows their contamination, according to the above regression model, 
for the response outlier is added to y. All simulation results are based on 150 
replications enough to obtain a relative error < 10% with a reasonable confidence 
level of at least 90% for all the simulation estimates. The robust scale estimate 
a from LTS with coverage fc = 28 is used throughout the simulation study. 

Tables [7] and [5] display the results concerning the performance of the three 
robust estimators corresponding to the following cases: Table [7] is based on 
data contaminated by "bad" and "good" high leverage points whereas Table [8] 
is based on data contaminated by "bad" high leverage outliers (heavier contam- 
ination). 





Fast-PTS 


Fast-S 


Fast-LTS 


%wrong 


2.7 


13.3 


16.0 


mean estimate of $o 


-1.796 


-1.633 


-2.782 


mean estimate of /3i 


1.176 


1.129 


1.147 


mean estimate of 02 


-0.766 


-0.758 


-0.732 


mean squared error of /3o 


35.38 


58.41 


112.21 


mean squared error of $i 


0.036 


0.106 


0.132 


mean squared error of f^2 


0.015 


0.030 


0.049 


computation time (sees) 


0.56 


0.29 


0.18 



Table 7: Samples with 32% of outliers (a;-outhers=6, "good" leverage points=4, 
y-outliers=6), n = 50, p = 3. True: (3o = 0.0, /3i = 1.20, P2 = -0.80. 



6 Conclusions and Future Research 

The PTS estimator can be considered as a procedure which instead of improving 
upon previous robust estimators such as LTS and MCD, it combines the infor- 
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Fast-PTS 


Fast-S 


Fast-LTS 


%wrong 


26.0 


61.3 


50.0 


mean estimate of 0o 


-3.955 


-1.607 


-2.123 


mean estimate of $i 


1.131 


0.814 


0.910 


mean estimate of $2 


-0.646 


-0.569 


-0.600 


mean squared error of $0 


84.41 


158.25 


173.98 


mean squared error of $1 


0.088 


0.473 


0.390 


mean squared error of (32 


0.084 


0.167 


0.154 


CPU time 


0.55 


0.27 


0.16 



Table 8: Samples with 32% of outliers (a;-outliers=10, y-outliers=6), n = 50, 
p = 3. True: /3o = 0.0, /3i = 1.20, P2 = -0.80. 

mation so obtained from them in a penalized manner, so as to simultaneously 
retain their favorable robust characteristics, while being efficient and identifying 
multiple masked outliers. Specifically, this is achieved by using a robust scale 
and leverage from the LTS and MCD respectively. Moreover, an efficient fast 
algorithm which is based on a set of necessary conditions for local optimality is 
also presented. Extensive computational experiments on a large set of instances, 
indicate that the proposed estimator ouperforms other robust estimators in the 
presence of all possible types of outliers. 

Future research can be performed on both the estimator and the algorithm. 
With respect to the estimator, a different robust scale & could be investigated, 
and also it can be extended to treat data sets which contain a mixture of mul- 
tilinear regression models. The algorithm could also be modified to handle 
massive data sets, which may have millions of observations, by utilizing the 
natural parallel structure of the algorithm. 
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